Skip to content

feat: provide Request instances in skipped request callbacks#1927

Open
lorenz-lb wants to merge 1 commit into
apify:masterfrom
lorenz-lb:master
Open

feat: provide Request instances in skipped request callbacks#1927
lorenz-lb wants to merge 1 commit into
apify:masterfrom
lorenz-lb:master

Conversation

@lorenz-lb

@lorenz-lb lorenz-lb commented May 31, 2026

Copy link
Copy Markdown

Description

This PR changes the skipped request handler to receive a Request object instead of only the URL str.

Skipped URLs are now converted into Request objects before the callback is invoked. This ensures that request metadata remains available for skipped requests, including request.user_data and other request attributes.

Issues

  • N/A

Testing

  • Added/updated unit tests for skipped request handlers.
  • Verified that skipped request handlers receive the original Request object.
  • Ran uv run poe check-code.

Checklist

  • CI passed

Future improvements

While working on this change, I noticed that many parts of the internal processing pipeline operate on str | Request unions.

A possible future improvement could be to standardize on Request objects internally and only accept str | Request at the public API boundary. URLs could then be wrapped into Request instances immediately, allowing the rest of the codebase to operate exclusively on Request objects.

This could simplify typing, reduce repeated union handling, and make request metadata consistently available throughout the pipeline. However, such a change would require broader refactoring and additional testing, so it is outside the scope of this PR.

@janbuchar janbuchar self-requested a review June 3, 2026 19:07
ErrorHandler = Callable[[TCrawlingContext, Exception], Awaitable[Request | None]]
FailedRequestHandler = Callable[[TCrawlingContext, Exception], Awaitable[None]]
SkippedRequestCallback = Callable[[str, SkippedReason], Awaitable[None]]
SkippedRequestCallback = Callable[[Request, SkippedReason], Awaitable[None]]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change to a public API. Existing callbacks like async def cb(url: str, reason) will now receive a Request; string usage like url.startswith(...) will crash. It should work with both str and Request.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you can start sending Request instances into existing callbacks that expect str without breaking BC without runtime type inspection — do we really want to go that way?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you can start sending Request instances into existing callbacks that expect str without breaking BC without runtime type inspection — do we really want to go that way?

You're right. Well, this may be a v2.0 material.

if self._on_skipped_request:
try:
await self._on_skipped_request(url, reason)
await self._on_skipped_request(request, reason)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now forwards request straight to the callback, but add_requests() (just below, ~line 841) wasn't updated — it still builds skipped from the original Sequence[str | Request] and passes a possibly-str item here:

for request in requests:
    check_url = request.url if isinstance(request, Request) else request
    if await self._is_allowed_based_on_robots_txt_file(check_url):
        allowed_requests.append(request)
    else:
        skipped.append(request)  # <- can be a plain str

So with respect_robots_txt_file=True, await crawler.add_requests(['https://disallowed/...']) on a robots-disallowed URL delivers a str to the callback, and request.url (as in the docs example) raises AttributeErrorUserDefinedErrorHandlerError.

Suggest normalizing str → Request once at the choke point (here, or in add_requests) instead of only in the two extract_links impls. That also makes the isinstance(request, Request) guard a few lines up dead code (it now contradicts the request: Request annotation). This path also has no test coverage.

requests = list[Request]()
skipped = list[Request]()

def create_request(request_options: RequestOptions) -> Request | None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create_request is now duplicated verbatim (including the multi-line debug message) with PlaywrightCrawler.extract_links (_playwright_crawler.py:464). Both crawlers extend BasicCrawler, so this looks like a good candidate for a single shared helper in crawlee/_utils (taking a logger) to keep the two copies from drifting.

context.log.debug(
f'Skipping URL "{request_options["url"]}" due to invalid format: {exc}. '
'This may be caused by a malformed URL or unsupported URL scheme. '
'Please ensure the URL is correct and retry.'

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that this helper is also used for robots-skipped, auto-discovered links (not just user enqueues), the "Please ensure the URL is correct and retry." wording is misleading — the operator never submitted this URL and there's nothing to retry. Consider a neutral message, or a separate one for the skip path.

else context.request.loaded_url or context.request.url
)
links_iterator = to_absolute_url_iterator(base_url, links_iterator, logger=context.log)
skipped_iterator = iter([])

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: skipped_iterator = iter([]) followed by a conditional reassignment inside if robots_txt_file: reads a little awkwardly (and iter([]) infers Iterator[Never]). The previous if/else, or guarding the skipped-building loop under the same if robots_txt_file:, is clearer.

request = create_request(request_options)

if request is not None:
skipped.append(request)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things about this skipped-building loop:

  1. Behavior change / silent drop: previously every robots-disallowed URL reached the callback as a raw string; now they go through Request.from_url, which enforces http/https (validate_http_url). A disallowed-but-non-http(s) or malformed-but-absolute URL (one that passes to_absolute_url_iterator but fails from_url) is now silently dropped and the skip callback never fires for it. Intended?

  2. Consistency: these skipped requests are built directly and don't pass through transform_request_function, while the enqueued ones do. So a skipped Request carries the base label/user_data/enqueue_strategy but not the user's per-request transform — inconsistent with enqueued requests from the same links.

requests = list[Request]()
skipped = list[Request]()

def create_request(request_options: RequestOptions) -> Request | None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the HTTP crawler: this create_request is a verbatim duplicate (candidate for a shared helper), and the skipped-building loop below has the same silent-drop and transform_request_function-not-applied behavior. See the comments on _abstract_http_crawler.py.


requests = [call.args[0] for call in skip.call_args_list]

all(isinstance(request, Request) for request in requests)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No assert?


requests = [call.args[0] for call in skip.call_args_list]

all(isinstance(request, Request) for request in requests)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No assert?


requests = [call.args[0] for call in skip.call_args_list]

all(isinstance(request, Request) for request in requests)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No assert?

@lorenz-lb

Copy link
Copy Markdown
Author

Thanks for the review! Sorry for some obvious mistakes, I started to use crawlee only recently, clearly skill issue on my side. I'll incorporate your feedback very very soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants